1 Case study 1: Audience Size

1.1 Data preparation

Loading survey data.

  1. Selecting and renaming only specified variables of interest for clarity.

  2. Handling missing/wrongly filled values of the selected variables

Generally, data may be wrong or incorrect if participants were confused by the instructions or inattentive to the survey.

Dropping any respondents who are missing answers for listening to Sirius Radio or Wharton radio, since this data won’t be usable for our analyses. Further, because the Wharton talk show is hosted on the Sirius platform, any subjects who reported listening to the Wharton radio show but not to Sirius Radio were excluded on the basis of input error.

radio <- radio %>% drop_na(sirius) %>% drop_na(wharton) #drop any rows missing sirius/wharton listenership, since that's what we care about most

#remove nonsense subjects
radio[!(radio$sirius == "No" & radio$wharton == "Yes"),]

Cleaning age data to ensure data is a continuous numerical. Dropping any datapoints outside the range of 18 to 110 years which are likely erroneous; MTurk requires that workers be at least 18 years old and age above 110 is rare.

#find variables in age with no numerical component
parse_number(radio$age) 
## Warning: 1 parsing failure.
## row col expected actual
## 880  -- a number female
#replace chr variables identified above with NA and recover entries with both chr & int data via parse_number()
radio <- radio %>% mutate(age = parse_number(age, na="female")) 

#now that age is numeric, drop data outside reasonable age range (must be 18+ to work for mturk):
radio %>% summarise(min=min(age, na.rm=T), max=max(age, na.rm=T))
is.na(radio$age) <- which(radio$age < 18 | radio$age >110)

Cleaning categorical variables by marking as factors and ordering levels as appropriate for inherently ranked categories (i.e. education, income).

#converting categorical data to factors:
radio <- radio %>% mutate(gender = as.factor(gender),
          education = as.factor(education),
          income = as.factor(income),
          sirius = as.factor(sirius),
          wharton = as.factor(wharton))
#listing unique levels to make sure all make sense
radio %>% sapply(levels)

#clean up education:
  #replace unselected education values with NA
is.na(radio$education) <- which(radio$education == "select one") 
  #drop the now unused "select one" level
radio$education <-droplevels(radio$education) 
  #reorder
radio$education<- factor(radio$education, levels=c("Less than 12 years; no high school diploma", "High school graduate (or equivalent)", "Some college, no diploma; or Associate’s degree", "Bachelor’s degree or other 4-year degree", "Graduate or professional degree", "Other"))

#reordrer income levels
radio$income<- factor(radio$income, levels=c("Less than $15,000", "$15,000 - $30,000", "$30,000 - $50,000", "$50,000 - $75,000", "$75,000 - $150,000", "Above $150,000"))

Check worktime variable to ensure data is reasonable, no changes needed.

#look at worktime - automatically calculated by site and all seem to make sense, no corrections needed
sum(is.na(radio$worktime))
radio %>% summarise(min=min(worktime, na.rm=T), max=max(worktime, na.rm=T), mean=mean(worktime))
  1. Brief summary

Usable data was obtained from 1757 subjects. Of these, 1.707% of subject’s data was incomplete (i.e. missing one or more demographic variables). The number of usable responses for each variable is shown below:

Summary statistics for each variable - including frequency of categorical responses, mean range of continuous variables, and counts for missing values - are displayed below:

The distribution of respondents’ ages is plotted below. Respondents ages ranged from 18 to r max(radio$age, na.rm=T) years with a mean of 30.439 years.

a <- ggplot(radio, aes(x=age)) + 
  geom_histogram(bins=25, color="white", fill= "darkblue" ) +
  labs(title="Reported age survey respondents")
a
## Warning: Removed 3 rows containing non-finite values (stat_bin).

Respondents’ reported educational attainment is plotted below.

redu <- radio %>% drop_na(education) #drop missing education values prior to plotting

d<- ggplot(redu) + 
  geom_bar(aes(x=education), stat="count", fill="darkblue") + 
  theme(axis.text.x = element_text(angle = 60, hjust=1)) + 
  ggtitle("Reported education of survey respondents")
d

Respondents most frequently reported having completed some college or An associate’s degree (42.701% of 1740 subjects for whom education data was available).

Listening habits of all survey respondents are plotted below. Most respondents reported listening to Sirius Radio stations, which may indicate that MTurk workers who listen to Sirius were more likely to opt-into completing a survey about Sirius Radio.

w<- ggplot(radio, aes(x=sirius, fill=wharton)) + 
  geom_bar(stat="count") + 
  scale_fill_brewer(palette="Blues") + 
  labs(title="Reported listening habits of survey respondents", x="Listen to Sirius Radio") + 
  guides(fill=guide_legend(title="Listen to Wharton Radio Show")) 
w

Reported household income was plotted among those who do and do not listen to Wharton’s radio show. Respondents most frequently reported an annual household income of $30-$50,000.

rinc <- radio %>% drop_na(income) #drop missing income values prior to plotting

i<- ggplot(rinc, aes(x=income, fill=wharton)) + 
  geom_bar(stat="count") +
  labs(title="Reported household income and listenership of survey respondents", x="Annual Household Income") + 
  guides(fill=guide_legend(title="Listen to Wharton Radio Show")) 
i

While the number of individuals in each bracket varies considerably across income levels, the count of individuals who listen to Wharton radio appears relatively stable across income brackets. We further probed the possible relationship of income and listening to Wharton’s radio show by calculating the percentage of total respondents in each income bracket who report listening to Wharton radio, plotted below. The income group with the greatest percentage of Wharton radio listeners was those earning over $150,000 per year.

The distribution of the time taken by each respondents to complete the survey is plotted below. Respondents time to completion ranged from 8 to r max(radio$worktime, na.rm=T) seconds with a mean of 22.522 seconds.

t <- ggplot(radio, aes(x=worktime)) + 
  geom_histogram(bins=25, color="white", fill= "darkgreen") + 
  labs(title="Time to survey completion", x = "Worktime (sec)")
t

1.2 Sample properties

1.2.1 Comparison to General Population

Data describing the demographics of the US general population was obtained from the 2014 American Community Survey, data available from the US Census Bureau here.

The distributions of age and gender in the present study sample are plotted alongside the 2014 US General population below.

#define API key for loading census data
census_key <- "6df3ff65073680a1ec1a64376da84217f13d7a53"

#load variable names of ACS data
acsvars <- load_variables(2014, "acs1", cache=TRUE)

#load ACS data on sex & age

acsage <- get_acs(geography="us", table = "B01001", year=2014, key=census_key, survey="acs1")
## The 1-year ACS provides data for geographies with populations of 65,000 and greater.
## Getting data from the 2014 1-year ACS
#recode variable names and split out gender
usage <- acsvars %>% inner_join(acsage, by=c("name"="variable"), keep=FALSE) %>% select(label, estimate) 
usage$label <- gsub("Estimate!!Total!!","", as.character(usage$label))
usage <- usage %>% separate(label, c("gender", "age"), "!!")
## Warning: Expected 2 pieces. Missing pieces filled with `NA` in 2 rows [2, 26].
#keep only relevant age bins (>=18)
keep<- c("18 and 19 years", "20 years", "21 years", "22 to 24 years", "25 to 29 years", "30 to 34 years", "35 to 39 years", "40 to 44 years", "45 to 49 years", "50 to 54 years", "55 to 59 years", "60 and 61 years", "62 to 64 years", "65 and 66 years", "67 to 69 years", "70 to 74 years", "75 to 79 years", "80 to 84 years", "85 years and over")
subage <- usage[usage$age %in% keep,]
#collapse age 20 and age 21 bins to make bin representations more meaningful/consistent
subage$age<- fct_collapse(subage$age, "20 to 21 years"=c("20 years", "21 years"))
subage

#plot US census age and sex vs sample 
samp_age <- radio %>% drop_na(gender, age) 
page1 <- ggplot(samp_age, aes(x=age, fill=gender)) + 
  geom_histogram(bins=20, color="white") +
  labs(title="Survey respondents")
page2 <- ggplot(subage, aes(x=age, y=estimate, fill=gender)) + 
  geom_bar(stat="identity") +
  theme(axis.text.x = element_text(angle = 60, hjust=1)) + 
  labs(title="US population", y="count")

# put p1 and p2 together
comp1 <- ggpubr::ggarrange(page1, page2, common.legend = T, legend = "bottom", ncol = 2)
annotate_figure(p=comp1, top = text_grob("Age and Gender of Survey Respondents vs US General Population", 
                               color = "black", face = "bold", size = 14))

Direct comparison is somewhat difficult as age data is reported by the ACS in uneven bins (eg. 60 and 61 yrs vs 45-49 yrs). However, it appears based on these data that study respondents tended to be younger than the general US population, with a lower proportion of respondents in middle age than would be expected from a representative sample. Gender is represented roughly evenly in both populations.

The distribution of reported educational attainment in the study sample are plotted alongside that of the 2014 US General population below.

#load ACS data on education

acsedu <- get_acs(geography="us", table = "B15001", year=2014, key=census_key, survey="acs1")
## The 1-year ACS provides data for geographies with populations of 65,000 and greater.
## Getting data from the 2014 1-year ACS
#recode variable names and split out gender, age and education
usedu <- acsvars %>% inner_join(acsedu, by=c("name"="variable"), keep=FALSE) %>% select(label, estimate) 
usedu$label <- gsub("Estimate!!Total!!","", as.character(usedu$label))
usedu <- usedu %>% separate(label, c("gender", "age", "education"), "!!") %>% mutate(education=as.factor(education))
## Warning: Expected 3 pieces. Missing pieces filled with `NA` in 13 rows [1, 2, 3,
## 11, 19, 27, 35, 43, 44, 52, 60, 68, 76].
levels(usedu$education)

#keep only rows with education info and reorder factor levels
keep2 <- c("Less than 9th grade", "9th to 12th grade, no diploma", "High school graduate (includes equivalency)", "Some college, no degree", "Associate's degree", "Bachelor's degree", "Graduate or professional degree")
usedu$education<-factor(usedu$education, levels=keep2) #set as factor & define levels
subedu <- usedu[usedu$education %in% keep2,] #drop rows without education levels
#collapse variable levels to match sample
subedu$education<- fct_collapse(subedu$education, "Less than 12 years; no high school diploma"=c("Less than 9th grade", "9th to 12th grade, no diploma"),
                                "Some college, no diploma; or Associate’s degree"=c("Some college, no degree", "Associate's degree"))
#aggregate across gender and age since we just want education
subedu <- subedu %>% group_by(education) %>% summarize(estimate=sum(estimate))

#plot US census education vs sample 
pedu1 <- ggplot(redu) + 
  geom_bar(aes(x=education), stat="count", fill="darkblue") + 
  theme(axis.text.x = element_text(angle = 60, hjust=1)) + 
  labs(title="Survey respondents")
pedu2 <- ggplot(subedu, aes(x=education, y=estimate)) + 
  geom_bar(stat="identity", fill="darkgreen") +
  theme(axis.text.x = element_text(angle = 60, hjust=1)) + 
  labs(title="US population", y="count")

# put p1 and p2 together
comp2 <- ggpubr::ggarrange(pedu1, pedu2, common.legend = T, legend = "bottom", ncol = 2)
annotate_figure(p=comp2, top = text_grob("Education of Survey Respondents vs US General Population", 
                               color = "black", face = "bold", size = 14))

Based on the data presented above, it appears that those with some college, an Associate’s, and/or Bachelor’s degree are overrepresented in the study sample compared to the general population. This could reflect the demographics of the MTurk workforce, which are explored below.

The distribution of reported annual household income in the study sample are plotted alongside that of the 2014 US General population below.

#load ACS data on income
acsinc <- get_acs(geography="us", table = "B19037", year=2014, key=census_key, survey="acs1")
## The 1-year ACS provides data for geographies with populations of 65,000 and greater.
## Getting data from the 2014 1-year ACS
#recode variable names and split out age and education
usinc <- acsvars %>% inner_join(acsinc, by=c("name"="variable"), keep=FALSE) %>% select(label, estimate) 
usinc$label <- gsub("Estimate!!Total!!","", as.character(usinc$label))
usinc <- usinc %>% separate(label, c("age", "income"), "!!") %>% mutate(income=as.factor(income)) 
## Warning: Expected 2 pieces. Missing pieces filled with `NA` in 4 rows [2, 19,
## 36, 53].
levels(usinc$income)

#keep only rows with income info and reorder factor levels
keep3 <- c("Less than $10,000", "$10,000 to $14,999", "$15,000 to $19,999", "$20,000 to $24,999", "$25,000 to $29,999", "$30,000 to $34,999", "$35,000 to $39,999", "$40,000 to $44,999", "$50,000 to $59,999", "$60,000 to $74,999", "$75,000 to $99,999", "$100,000 to $124,999", "$125,000 to $149,999", "$150,000 to $199,999", "$200,000 or more")
usinc$income<-factor(usinc$income, levels=keep3) #define and order levels
subinc <- usinc[usinc$income %in% keep3,] #drop rows without income level
#collapse variable levels to match sample
subinc$income<- fct_collapse(subinc$income, "Less than $15,000"=c("Less than $10,000", "$10,000 to $14,999"), 
                             "$15,000 - $30,000" =c("$15,000 to $19,999", "$20,000 to $24,999", "$25,000 to $29,999"),
                             "$30,000 - $50,000"= c("$30,000 to $34,999", "$35,000 to $39,999", "$40,000 to $44,999"),
                             "$50,000 - $75,000"= c("$50,000 to $59,999", "$60,000 to $74,999"),
                             "$75,000 - $150,000"= c("$75,000 to $99,999", "$100,000 to $124,999", "$125,000 to $149,999"),
                             "Above $150,000" = c("$150,000 to $199,999", "$200,000 or more"))
#aggregate across age since we just want income
subinc <- subinc %>% group_by(income) %>% summarize(estimate=sum(estimate))

#plot US census income vs sample 
pinc1 <- ggplot(rinc) + 
  geom_bar(aes(x=income), stat="count", fill="darkblue") + 
  theme(axis.text.x = element_text(angle = 60, hjust=1)) + 
  labs(title="Survey respondents")
pinc2 <- ggplot(subinc, aes(x=income, y=estimate)) + 
  geom_bar(stat="identity", fill="darkgreen") +
  theme(axis.text.x = element_text(angle = 60, hjust=1)) + 
  labs(title="US population", y="count")

# put p1 and p2 together
comp3 <- ggpubr::ggarrange(pinc1, pinc2, common.legend = T, legend = "bottom", ncol = 2)
annotate_figure(p=comp3, top = text_grob("Income of Survey Respondents vs US General Population", 
                               color = "black", face = "bold", size = 14))

Interestingly, the most frequently reported household income in the US was higher than that of our survey respondents ($75,000-$150,000 in the ACS vs $30,000-$50,000 in the study sample). This is could reflect the comparatively larger proportion of respondents who are likely to be college students (“Some college, no diploma; or Associate’s degree”) and therefore are less likely to hold high-paying, full-time jobs than the general adult population. It may also reflect the demographcis of MTurk workers, discussed below.

1.2.2 Comparison to MTURK Population

Data describing the demographics of MTURK workers in the US is from Moss et al’s Is it Ethical to Use Mechanical Turk for Behavioral Research? Relevant Data from a Representative Survey of MTurk Participants and Wages and was obtained here. Data was collected in 2019.

#load mturk data
mtrk <- read.spss("data/MTurkEthics_Study1_Data.sav", use.value.labels = TRUE, to.data.frame=TRUE, add.undeclared.levels="no")
## Warning in read.spss("data/MTurkEthics_Study1_Data.sav", use.value.labels =
## TRUE, : data/MTurkEthics_Study1_Data.sav: Very long string record(s) found
## (record type 7, subtype 14), each will be imported in consecutive separate
## variables
#subset and rename columns
mtrk <- mtrk %>% select(age, Whatisyourgender, Whatisthehighestlevelofschoolyouhavecompletedorthehighestdegreey, Thinkingbackoverthelastyearwhatwasyourfamilysannualincomeincludi) %>% rename(gender=Whatisyourgender, education=Whatisthehighestlevelofschoolyouhavecompletedorthehighestdegreey, income=Thinkingbackoverthelastyearwhatwasyourfamilysannualincomeincludi)

The distributions of age and gender in the present study sample are plotted alongside the that of MTurk workers below.

#filter age to match sample
is.na(mtrk$age) <- which(mtrk$age < 18 | mtrk$age >110)
#drop Other gender to match sample
is.na(mtrk$gender) <- which(mtrk$gender=="Other")

#plot mturk age and sex vs sample 
mage1 <- ggplot(samp_age, aes(x=age, fill=gender)) + 
  geom_histogram(bins=20, color="white") +
  labs(title="Survey respondents")
m_age <- mtrk %>% drop_na(gender, age) #drop na before plotting
mage2 <- ggplot(m_age, aes(x=age, fill=gender)) + 
  geom_histogram(bins=20, color="white") +
  labs(title="MTurk Workers", y="count")

# put p1 and p2 together
comp4 <- ggpubr::ggarrange(mage1, mage2, common.legend = T, legend = "bottom", ncol = 2)
annotate_figure(p=comp4, top = text_grob("Age and Gender of Survey Respondents vs MTURK Workers", 
                               color = "black", face = "bold", size = 14))

Overall, distributions look similar in terms of age and gender distribution. However, as with the general population, our study sample appears to have a lower proportion of mid-aged (~45 to 65 yrs) individuals than the broader MTurk sample.

The distribution of reported educational attainment in the study sample is plotted alongside highest level of education among MTurk Workers below.

#collapse variable levels to match sample
levels(mtrk$education)
mtrk$education<- fct_collapse(mtrk$education, "Less than 12 years; no high school diploma"=c(levels(mtrk$education)[1:8]),
                              "Some college, no diploma; or Associate’s degree"=c(levels(mtrk$education)[10:12]),
                              "Graduate or professional degree" =c(levels(mtrk$education)[14:16])) 

#plot mturk education vs sample 
medu1 <- ggplot(redu) + 
  geom_bar(aes(x=education), stat="count", fill="darkblue") + 
  theme(axis.text.x = element_text(angle = 60, hjust=1)) + 
  labs(title="Survey respondents")
m_edu <- mtrk %>% drop_na(education) #drop na before plotting
medu2 <- ggplot(m_edu) + 
  geom_bar(aes(x=education), stat="count", fill="darkgreen") +
  theme(axis.text.x = element_text(angle = 60, hjust=1)) + 
  labs(title="MTurk Workers", y="count")

# put p1 and p2 together
comp5 <- ggpubr::ggarrange(medu1, medu2, common.legend = T, legend = "bottom", ncol = 2)
annotate_figure(p=comp5, top = text_grob("Education of Survey Respondents vs MTURK Workers", 
                               color = "black", face = "bold", size = 14))

The educational attainment of our sample seems to be reflective of the larger MTurk workforce, with the only notable difference being an over-representation of those with some college/an Associate’s degree compared to college graduates, which are about equally common among the MTurk population. This could reflect sampling effects, or potentially something about the survey’s presentation on the MTurk platform was particularly appealing to the interests of current college students, making them more likely to opt-in.

Finally, the distribution of reported annual household income in the study sample are plotted alongside that of MTurk workers below.

#see levels
levels(mtrk$income)
#mturk income levels cutoff at different values from sample, not able to collapse levels to match

#plot mturk income vs sample 
minc1 <- ggplot(rinc) + 
  geom_bar(aes(x=income), stat="count", fill="darkblue") + 
  theme(axis.text.x = element_text(angle = 60, hjust=1)) + 
  labs(title="Survey respondents")
m_inc <- mtrk %>% drop_na(income) #drop na before plotting
minc2 <- ggplot(m_inc) + 
  geom_bar(aes(x=income), stat="count", fill="darkgreen") +
  theme(axis.text.x = element_text(angle = 60, hjust=1)) + 
  labs(title="MTurk Workers", y="count")

# put p1 and p2 together
comp6 <- ggpubr::ggarrange(minc1, minc2, common.legend = T, legend = "bottom", ncol = 2)
annotate_figure(p=comp6, top = text_grob("Income of Survey Respondents vs MTURK Workers", 
                               color = "black", face = "bold", size = 14))

Differences in the income bins defined by the present survey and that assessing MTurk worker demographics make direct comparison difficult. However, the incomes of respondents in the Wharton survey appears more normally distributed than that of MTurk workers, who are unexpectedly overrepresented in the $100-$149K income bracket. Further query would need to be conducted to assess the root of this divergence, though it may be related to the somewhat greater proportion of middle-age MTurk workers as compared to the study sample.

1.3 Final estimate

The Wharton radio listenership comprised of approximately 265,979 individuals in January 2014, assuming that the study sample’s reported listening habits are representative of the general population.

The aim of the present study was to estimate the audience size of radio show Business Radio Powered by the Wharton School (Wharton Radio), available on Sirius Radio. Audience size was estimated by assessing the proportion of Sirius Radio listeners who reported listening to the Wharton talk show. Data was collected via survey on Amazon’s Mechanical Turk (MTurk) platform in May 2014. Of 1757 survey respondents, 70 of 1,358 Sirius Radio listeners reported listening to Wharton radio (5.155%). This ratio was extrapolated to the approximately 51.6 million Sirius listeners in 2014, resulting in an estimated 265,979 listeners to Wharton radio. This finding is limited by sampling methodology, which was restricted to individuals enrolled in MTurk and may not be representative of Sirius Radio listeners.

1.4 New task

We propose the following study to estimate the audience size of the Wharton Business Radio Show today:

We endeavor to measure the proportion of Sirius Radio listeners who occasionally or regularly listen to the Wharton radio show. Sirius Radio is a subscription-based service, meaning that Sirius already collects contact information and data on users. Therefore we would efficiently target our survey to known Sirius Radio listeners, by purchasing $500 worth of subscribers’ email addresses. Subscribers would be sampled randomly (restricted by Sirius to those who have not opted out of such communications) and sample size would be restricted by the cost of subscriber emails with a maximum of $500.

Subjects would be sent a survey using Qualtrics, a data collection platform free to members of the Penn Community. Subjects would be incentivized to complete the survey with a chance to win one of five $100 Visa gift-cards.

In the survey we would ask respondents to report whether they have ever listened to the Wharton Business Radio Show, and if so to report how regularly they listen in units of average episodes per month. By assessing how frequently individuals listen to the Wharton show we hope to build a more complete picture of the regular listening audience and compare the ratio of occasional Wharton listeners among Sirius users with those who listen frequently. We would also assess demographics (age, sex, income, education) in order to assess for any skew that may affect generalizability to Sirius listeners overall. Importantly, we would help maintain the quality of collected data by making all fields required and restricting entries in each field to the correct data type (drop-downs, characters, or integers, as appropriate).

Finally, we would use the proportion of Sirius listeners who report ever listening to Wharton radio multiplied by the total number of Sirius Radio subscribers to estimate audience size. We could further probe the sample by applying the proportion of respondents who frequently listen (e.g. at least 4 times per month) to Wharton radio to estimate the number of regular listeners.

2 Case study 2: Women in Science

Are women underrepresented in science in general? How does gender relate to the type of educational degree pursued? Does the number of higher degrees increase over the years? In an attempt to answer these questions, we assembled a data set (WomenData_06_16.xlsx) from NSF about various degrees granted in the U.S. from 2006 to 2016. It contains the following variables: Field (Non-science-engineering (Non-S&E) and sciences (Computer sciences, Mathematics and statistics, etc.)), Degree (BS, MS, PhD), Sex (M, F), Number of degrees granted, and Year.

Our goal is to answer the above questions only through EDA (Exploratory Data Analyses) without formal testing. We have provided sample R-codes in the appendix to help you if needed.

2.1 Data preparation

  1. Understand and clean the data

Notice the data came in as an Excel file. We need to use the package readxl and the function read_excel() to read the data WomenData_06_16.xlsx into R.

First, we load the data using readxl,rename Field and Number variables, change Field names to shorter labels and transform relevant variables (Field, Degree, Sex) to factors.

# i. Read the data into R
w_df <- read_excel("data/WomenData_06_16.xlsx")

# ii. Clean the names of each variables. 
# Change variable names to  `Field`,`Degree`, `Sex`, `Year` and `Number`. 
w_df <- w_df %>% rename("Field" = "Field and sex",
                        "Number" = "Degrees Awarded")

# iii. Set the variable natures properly
str(w_df)

# renaming a field name to shorter name
w_df[w_df$Field == "Earth, atmospheric, and ocean sciences", "Field"] <- "Environmental"
w_df[w_df$Field == "Mathematics and statistics", "Field"] <- "Math & Stats"
w_df[w_df$Field == "Agricultural sciences", "Field"] <- "Agriculture"
w_df[w_df$Field == "Biological sciences", "Field"] <- "Biology"
w_df[w_df$Field == "Social sciences", "Field"] <- "Sociology"
w_df[w_df$Field == "Computer sciences", "Field"] <- "Computer"
w_df[w_df$Field == "Physical sciences", "Field"] <- "Physics"

# Changing categorical variables from character to factor type
w_df %<>% mutate(Field = as.factor(Field),
         Degree = as.factor(Degree),
         Sex = as.factor(Sex))

# iv. Any missing values?
summary(w_df)
# no missing values. 

# count the number of fields
length(unique(w_df$Field))

# count the number of degrees
length(unique(w_df$Degree))

# count the number of years reported
max(w_df$Year) - min(w_df$Year)
# preparing html output table
summary(w_df)

# nicer table with kableExtra
# kbl(summary(w_df)) %>%
#   kable_styling(bootstrap_options = "striped", full_width = F, position = "left")
##            Field     Degree        Sex           Year          Number      
##  Agriculture  : 66   BS :220   Female:330   Min.   :2006   Min.   :   218  
##  Biology      : 66   MS :220   Male  :330   1st Qu.:2008   1st Qu.:  2118  
##  Computer     : 66   PhD:220                Median :2011   Median :  6020  
##  Engineering  : 66                          Mean   :2011   Mean   : 41717  
##  Environmental: 66                          3rd Qu.:2014   3rd Qu.: 18127  
##  Math & Stats : 66                          Max.   :2016   Max.   :781474  
##  (Other)      :264
  1. Write a summary describing the data set provided here.

This dataset comprises 5 variables (Field, Degree, Sex, Year and Number). None of the variables seem to have NAs. The number of awarded degrees is reported for several science fields, across 3 degrees (BS, MS, PhD) and over a period of 10 years (2006-2016).

  1. How many fields are there in this data?

There are 10 different fields of study in this dataset.

  1. What are the degree types?

There are 3 degree types.

  1. How many year’s statistics are being reported here?

This dataset is reporting degrees for a duration of 10 years.

2.2 BS degrees in 2015

Is there evidence that more males are in science-related fields vs Non-S&E? Provide summary statistics and a plot which shows the number of people by gender and by field. Write a brief summary to describe your findings.

2.2.1 Summary statistics by field and gender

First, we are creating an aggregated table showing the total number of awarded degrees by Field and Sex.

# summarizing by field and sex
agg <- w_df %>% 
  group_by(Field, Sex) %>% 
  summarise(
    Total_Number = sum(Number)
    )
agg

# #  nicer table with kableExtra
# kbl(agg) %>%
#   kable_paper() %>%
#     scroll_box(height = "300px")
## # A tibble: 20 × 3
## # Groups:   Field [10]
##    Field         Sex    Total_Number
##    <fct>         <fct>         <dbl>
##  1 Agriculture   Female       172852
##  2 Agriculture   Male         152956
##  3 Biology       Female       745384
##  4 Biology       Male         516556
##  5 Computer      Female       171808
##  6 Computer      Male         626248
##  7 Engineering   Female       301426
##  8 Engineering   Male        1147112
##  9 Environmental Female        36076
## 10 Environmental Male          53118
## 11 Math & Stats  Female       125083
## 12 Math & Stats  Male         173641
## 13 Non-S&E       Female     11916324
## 14 Non-S&E       Male        7356720
## 15 Physics       Female       120797
## 16 Physics       Male         194365
## 17 Psychology    Female      1109902
## 18 Psychology    Male         326865
## 19 Sociology     Female      1244592
## 20 Sociology     Male        1041377

2.2.2 Plots by field and gender

Here, we are creating a new science vs non-science fields variable, so that we can plot the total number of awarded degrees in science vs non-science fields overall. Next, we plot the total number of awarded degrees by subfields and gender.

# disabling scientific notation for plotting
options(scipen = 999)

# creating new science vs non-science fields variable
w_df$Field_type <- ifelse(w_df$Field == "Non-S&E", c("Non-S&E fields"), c("Science fields"))

# getting total number of degrees per field type
w_df %>% group_by(Field_type, Sex) %>%
  summarise(total_number = sum(Number)) %>%
  # bar plots per gender and field type (while sorting data by max number in males per field)
  ggplot(aes(x = Field_type, y = total_number, fill = Sex, colour = Sex)) + 
    geom_bar(stat = "identity", position = "dodge", alpha = 0.8, width = 0.5) +
    theme_bw() +
    labs(x="Fields of study", y = "Total number") +
    ggtitle("Number of awarded degrees by field type and gender") +
    theme(plot.title = element_text(face="bold", size = 15)) +
    theme(legend.text = element_text(size = 10), 
          legend.title = element_text(size = 15),
          axis.title = element_text(size = 15),
          axis.text.x = element_text(size = 10), 
          axis.text.y = element_text(size = 10))

# getting total number of degrees per field
w_df %>% group_by(Field, Sex) %>%
  summarise(total_number = sum(Number)) %>%
  # bar plots per gender and field (while sorting data by max number in males per field)
  ggplot(aes(x = forcats::fct_reorder(Field, total_number), y = total_number, fill = Sex, colour = Sex)) + 
    geom_bar(stat = "identity", position = "dodge", alpha = 0.8, width = 0.5) +
    theme_bw() +
    labs(x="Fields of study", y = "Total number") +
    ggtitle("Number of awarded degrees by field of study and gender") +
    theme(plot.title = element_text(face="bold", size = 15)) +
    theme(legend.text = element_text(size = 10), 
          legend.title = element_text(size = 15),
          axis.title = element_text(size = 15),
          axis.text.x = element_text(size = 10, angle = 45, hjust = 1), 
          axis.text.y = element_text(size = 10))

Based on the above evidence, it appears that the number of men in Science fields is comparable to the number of women. This constrasts with Non-S&E fields, which comprises a much larger number of women. This finding should however be nuanced by the fact that subfields within Science fields show very different male/female proportions. For instance, Biological Sciences and Psychology have a majority of female students, while Computer sciences and Engineering have a majority of male students.

An additional observation is that the number of students in Non-S&E fields overall is much larger than in Science fields. To further improve the visualization of gender proportions, we created a graph displaying the propotions of each gender (rather than total number) in different fields.

# getting total number of degrees per field type
w_df %>% group_by(Field_type, Sex) %>%
  summarise(total_number = sum(Number)) %>%
  # bar plots per gender and field type (while sorting data by max number in males per field)
  ggplot(aes(x = Field_type, y = total_number, fill = Sex, colour = Sex)) + 
    geom_bar(stat = "identity", position = "fill", alpha = 0.8, width = 0.5) +
    theme_bw() +
    labs(x="Fields of study", y = "Proportion") +
    ggtitle("Proportion of awarded degrees by sex and field of study") +
    theme(plot.title = element_text(face="bold", size = 15)) +
    theme(legend.text = element_text(size = 10), 
          legend.title = element_text(size = 15),
          axis.title = element_text(size = 15),
          axis.text.x = element_text(size = 10), 
          axis.text.y = element_text(size = 10))

# getting total number of degrees per field 
w_df %>% group_by(Field, Sex) %>%
  summarise(total_number = sum(Number)) %>%
  arrange(desc(total_number))  %>%
  ggplot(aes(x = fct_reorder(Field, total_number), y = total_number, fill = Sex, colour = Sex)) + 
    geom_bar(stat = "identity", position = "fill", alpha = 0.8, width = 0.5) +
    theme_bw() +
    labs(x="Fields of study", y = "Proportion") +
    ggtitle("Proportion of awarded degrees by sex and field type") +
    theme(plot.title = element_text(face="bold", size = 15)) +
    theme(legend.text = element_text(size = 10), 
          legend.title = element_text(size = 15),
          axis.title = element_text(size = 15),
          axis.text.x = element_text(size = 10, angle = 45, hjust = 1), 
          axis.text.y = element_text(size = 10))

These plots confirm that the proportion of females is higher in Non-S&E fields, but is roughly 50% in Science fields. When considering science subfields, Computer Sciences, Engineering, Physical Sciences, Mathematics and Statistics, Environmnental Sciences seem to have a clear larger proportion of males. Thus, when considering all science fields together, there is no evidence of a larger pproportion of males. yet when considering by subfields there are striking differences in gender proportions.

2.3 EDA bringing type of degree, field and gender in 2015

Describe the number of people by type of degree, field, and gender. Do you see any evidence of gender effects over different types of degrees? Again, provide graphs to summarize your findings.

Here, we create a plot showing of number of awarded degrees by degree, field and gender and filter data to only show the year 2015.

# plot showing number of awarded degrees by gender (color) for the year 2015
w_df %>%
  filter(Year == 2015) %>%
  ggplot(aes(x = forcats::fct_reorder(Field, Number), y = Number, fill = Sex, colour = Sex)) + 
  geom_bar(stat = "identity", position = "dodge", alpha = 0.8, width = 0.5) +
  facet_grid(Degree~., scales = "free_y") +
  theme_bw() +
  labs(x="Fields of study") +
  ggtitle("Number of awarded degrees in 2015 by type of degree, field and gender") +
  theme(plot.title = element_text(face="bold", size = 15)) +
  theme(legend.text = element_text(size = 10), 
        legend.title = element_text(size = 15),
        axis.title = element_text(size = 15),
        axis.text.x = element_text(size = 10, angle = 45, hjust = 1), 
        axis.text.y = element_text(size = 10))

In 2015, the number of awarded bachelor degrees is larger overall in comparison to master and PhD degrees. Gender proportions are mostly similar in different degree types for a given subfield. However, some striking exceptions in this plot are Engineering and Physical Sciences, for which the number of males is much higher at the PhD level, as well as Psychology, which has a higher number of women at the PhD level.

2.4 EDA bring all variables

In this last portion of the EDA, we ask you to provide evidence numerically and graphically: Do the number of degrees change by gender, field, and time?

2.4.1 Summary statistics by field, gender and time

First, we are providing a numeric summary of the percentage of female students per field type, degree type and year.

# summarizing by field type, degree, time and sex
agg <- w_df %>% 
  group_by(Field_type, Sex, Year, Degree) %>% 
  summarise(total_number = sum(Number)) %>%
  group_by(Field_type, Year, Degree) %>% 
  mutate(percent = (total_number / sum(total_number)) * 100) %>%
  filter(Sex == "Female") %>%
  group_by(Field_type, Year, Degree) %>% 
  summarise("Percent female" = percent)
agg

# nicer table with kableExtra
# kbl(agg) %>%
#   kable_paper() %>%
#     scroll_box(height = "300px", width = "50%")
## # A tibble: 66 × 4
## # Groups:   Field_type, Year [22]
##    Field_type      Year Degree `Percent female`
##    <chr>          <dbl> <fct>             <dbl>
##  1 Non-S&E fields  2006 BS                 61.1
##  2 Non-S&E fields  2006 MS                 63.8
##  3 Non-S&E fields  2006 PhD                59.9
##  4 Non-S&E fields  2007 BS                 60.8
##  5 Non-S&E fields  2007 MS                 64.4
##  6 Non-S&E fields  2007 PhD                61.5
##  7 Non-S&E fields  2008 BS                 60.6
##  8 Non-S&E fields  2008 MS                 64.3
##  9 Non-S&E fields  2008 PhD                61.8
## 10 Non-S&E fields  2009 BS                 60.4
## # … with 56 more rows

We can observe that overall, the percent of women in science fields tends to be more equal at the BS level in most fields, but decreases in MS and PhD levels.

Next, we are providing a numeric summary of the percentage of female students per subfield, degree type and year.

# summarizing by field, degree, time and sex
agg <- w_df %>% 
  group_by(Field, Sex, Year, Degree) %>% 
  summarise(total_number = sum(Number)) %>%
  group_by(Field, Year, Degree) %>% 
  mutate(percent = (total_number / sum(total_number)) * 100) %>%
  filter(Sex == "Female") %>%
  group_by(Field, Year, Degree) %>% 
  summarise("Percent female" = percent)
agg

# nicer table with kableExtra
# kbl(agg) %>%
#   kable_paper() %>%
#     scroll_box(height = "300px", width = "50%")
## # A tibble: 330 × 4
## # Groups:   Field, Year [110]
##    Field        Year Degree `Percent female`
##    <fct>       <dbl> <fct>             <dbl>
##  1 Agriculture  2006 BS                 51.5
##  2 Agriculture  2006 MS                 52.5
##  3 Agriculture  2006 PhD                41.0
##  4 Agriculture  2007 BS                 50.4
##  5 Agriculture  2007 MS                 54.9
##  6 Agriculture  2007 PhD                41.0
##  7 Agriculture  2008 BS                 51.2
##  8 Agriculture  2008 MS                 54.6
##  9 Agriculture  2008 PhD                41.2
## 10 Agriculture  2009 BS                 51.3
## # … with 320 more rows

We can observe that the percent female tends to remain stable over time in any given field, and decreases somewhat at the PhD level.

Next, we are providing a summary of the difference in percent females per field and degree type between the year 2006 and 2006.

# summarizing the percent change from 2006-2016 by field, degree and sex
agg <- w_df %>% 
  group_by(Field, Sex, Year, Degree) %>% 
  summarise(total_number = sum(Number)) %>%
  group_by(Field, Year, Degree) %>% 
  mutate(percent = (total_number / sum(total_number)) * 100) %>%
  filter(Sex == "Female" & (Year == "2006" | Year == "2016")) %>%
  group_by(Field, Degree) %>% 
  mutate(percent_change = percent - lag(percent)) %>%
  filter(Year == "2016") %>%
  group_by(Field, Degree) %>% 
  summarise("2006-2016 percent change in females" = round(percent_change, 1))
agg

# nicer table with kableExtra
# kbl(agg) %>%
#   kable_paper() %>%
#     scroll_box(height = "300px", width = "50%")
## # A tibble: 30 × 3
## # Groups:   Field [10]
##    Field       Degree `2006-2016 percent change in females`
##    <fct>       <fct>                                  <dbl>
##  1 Agriculture BS                                       4  
##  2 Agriculture MS                                       4.4
##  3 Agriculture PhD                                      5.8
##  4 Biology     BS                                      -1.2
##  5 Biology     MS                                      -0.9
##  6 Biology     PhD                                      3.5
##  7 Computer    BS                                      -2  
##  8 Computer    MS                                       3.8
##  9 Computer    PhD                                     -1.6
## 10 Engineering BS                                       1.4
## # … with 20 more rows

We can observe that overall, the percent of females receiving a degree in various fields remains stable, with a maximum of 5.8% change in Agricultural Sciences at the PhD level (showing an increase of women at the PhD level over time in this subfield).

2.4.2 Plots by field, gender and time

In this section, we are plotting the data summarized numerically above. First, we plot overall science vs non-science fields by degree type and time.

# plot by field type, degree, time and sex
w_df %>% group_by(Field_type, Sex, Year, Degree) %>%
  summarise(total_number = sum(Number)) %>%
  ggplot(aes(x = Year, y = total_number, fill = Sex)) +
  geom_bar(stat = "identity", position = "fill") +
  facet_grid(Degree ~ Field_type, scales = "free_y") +
  theme_bw() +
  labs(x="Fields of study", y = "Proportion") +
  ggtitle("Proportion of awarded degrees by sex, degree and field type") +
  theme(plot.title = element_text(face="bold", size = 15)) +
  theme(legend.text = element_text(size = 10), 
        legend.title = element_text(size = 15),
        axis.title = element_text(size = 15),
        axis.text.x = element_text(size = 10), 
        axis.text.y = element_text(size = 10))

Overall, Science fields seem to have roughly 50% women, indicating a roughly equal representation of males and females when considering all science fields together. Of note, we observe a progressive increase in the proportion of males at the MS and PhD levels, compared to BS which has a roughly equal representation of both genders.

Next, we repeat the same plotting strategy by plot by science subfields to get a more granular level of information.

# plot by subfield, degree, time and sex
w_df %>% group_by(Field, Sex, Year, Degree) %>%
  summarise(total_number = sum(Number)) %>%
  ggplot(aes(x = Year, y = total_number, fill = Sex)) +
  geom_bar(stat = "identity", position = "fill") +
  facet_grid(Degree ~ Field, scales = "free_x") +
  ggtitle("Proportion of awarded degrees by sex, degree and field of study") +
  theme_bw() +
  labs(x="Time", y = "Proportion") +
  theme(plot.title = element_text(face="bold", size = 15)) +
  theme(legend.text = element_text(size = 10), 
        legend.title = element_text(size = 15),
        axis.title = element_text(size = 15),
        axis.text.x = element_text(size = 5), 
        axis.text.y = element_text(size = 10)) +
  scale_x_continuous(breaks = seq(2006, 2016, 10)) +
  theme(panel.spacing.x = unit(1, "mm"))

These graphs again highlight that even though overall, Science fields seem to have roughly 50% women, this does not hold true when considering science subfields at a more granular level. Specifically, we note the markedly higher proportion of men in Engineering, Computer Sciences, Physical Sciences, Mathematics and Statistics, Environmnental Sciences and the higher proportion of women in Psychology, Non-S&E and Biological Sciences. Moreover, gender proportions tend to remain table across the years (2006-2016), regardless of field type (Non-S&E vs Science fields) or subfields.

2.5 Women in Data Science

Finally, is there evidence showing that women are underrepresented in data science? Data science is an interdisciplinary field of computer science, math, and statistics. You may include year and/or degree.

In this section, we are plotting the proprtion of awarded degrees for data science fields by sex, degree and time.

# plot by degree, time and sex for data science fields only
w_df %>% group_by(Field, Sex, Year, Degree) %>%
  filter(Field %in% c("Computer", "Math & Stats")) %>%
  summarise(total_number = sum(Number)) %>%
  ggplot(aes(x = Year, y = total_number, fill = Sex)) +
  geom_bar(stat = "identity", position = "fill") +
  facet_grid(Degree ~ Field, scales = "free_x") +
  ggtitle("Proportion of awarded degrees in data science by sex, degree and time") +
  theme_bw() +
  labs(x="Time", y = "Proportion") +
  theme(plot.title = element_text(face="bold", size = 15)) +
  theme(legend.text = element_text(size = 10), 
        legend.title = element_text(size = 15),
        axis.title = element_text(size = 15),
        axis.text.x = element_text(size = 10), 
        axis.text.y = element_text(size = 10)) +
  theme(panel.spacing.x = unit(1, "mm"))

This graph provides clear evidence for an underrepresentation of women in data science, as Computer Sciences only comprises around 25% of women across all degree types (BS, MS, PhD), and Mathematics and Statistics comprises around 50-60% women, with a neat decrease in female representation at the PhD level (~25% women).

2.6 Final brief report

Summarize your findings focusing on answering the questions regarding if we see consistent patterns that more males pursue science-related fields. Any concerns with the data set? How could we improve on the study?

When considering all science fields together, the proportion of males and females engaging in science related fields appears similar and has remained balanced from 2006-2016, with a slight increase in the propotion of males at MS and PhD levels. However, when considering gender proportions in subfields of science, we notice a remarkable variation. Specifically, while a number of subfields appear balanced (Agricultural sciences, Social Sciences), certain subfields contain a strikingly larger proportion of women (Psychology, Biological Sciences) or men (Computer Sciences, Engineering, Physical Sciences, Mathematics and Statistics, Environmnental Sciences) respectively. Finally, these gender proportions remained stable across the considered 10 year period (2006-2016).

To further improve on the study, it would be interesting to investigate gender proportions among subfields within the Non-S$E category. Given that large discrepancies in gender proportions were found when studying science subfields even though the overall gender propotion seemed balanced, it is possible that gender proportions in subfields of Non-S&E also might vary a lot compared to the overall proportion reported in this dataset. It would also be interesting to collect this type of data in more recent years to determine whether the trends are now changing or still remain stable. Finally, another avenue to explore would be gender distributions across professions and type of function several years after graduation.

2.7 Appendix

To help out, we have included some R-codes here as references. You should make your own chunks filled with texts going through each items listed above. Make sure to hide the unnecessary outputs/code etc.

  1. Clean data

  2. A number of sample analyses

3 Case study 3: Major League Baseball

We would like to explore how payroll affects performance among Major League Baseball teams. The data is prepared in two formats record payroll, winning numbers/percentage by team from 1998 to 2014.

Here are the datasets:

-MLPayData_Total.csv: wide format -baseball.csv: long format

Feel free to use either dataset to address the problems.

3.1 EDA: Relationship between payroll changes and performance

Payroll may relate to performance among ML Baseball teams. One possible argument is that what affects this year’s performance is not this year’s payroll, but the amount that payroll increased from last year. Let us look into this through EDA.

Create increment in payroll

  1. To describe the increment of payroll in each year there are several possible approaches. Take 2013 as an example:

    • option 1: diff: payroll_2013 - payroll_2012
    • option 2: log diff: log(payroll_2013) - log(payroll_2012)

Explain why the log difference is more appropriate in this setup.

First we’ll read in the data and, to make sure we have what we need, display the first 5 rows.

baseball <- read.csv(file = 'data/baseball.csv') # read in the long-format csv as a data.frame
head(baseball, 5) # display the first 5 rows
##                   team year payroll win_num win_pct
## 1 Arizona Diamondbacks 1998    31.6      65   0.401
## 2 Arizona Diamondbacks 1999    70.5     100   0.617
## 3 Arizona Diamondbacks 2000    81.0      85   0.525
## 4 Arizona Diamondbacks 2001    81.2      92   0.568
## 5 Arizona Diamondbacks 2002   102.8      98   0.605
  1. Create a new variable diff_log=log(payroll_2013) - log(payroll_2012). Hint: use dplyr::lag() function.

  2. Create a long data table including: team, year, diff_log, win_pct

First, we create a new variable called diff_lag, which describes the log difference between a year’s payroll and the previous years’ payroll.

In this setup, it is more appropriate to work with log-transformed data because it is more informative about relative differences than the raw amount of payroll increase. In other words, the same amount of raw change in payroll would represent a more significant increase to a team earning less, than a team earning more, so we can take the differences of the log values to address this.

baseball$diff_log <-  log(baseball$payroll) - log(lag(baseball$payroll)) # here we create the new variable of differences of log values
baseball$diff_win_pct <-  baseball$win_pct - lag(baseball$win_pct) # I'm also going to create a variable that describes the change in win_pct, which will be useful in 4.2 and later sections

# now, we create a new table with our variables of interest
baseball_variables <- baseball %>% 
  select(team, year, diff_log, payroll, win_pct, diff_win_pct) %>%
  na.omit # Because the first year in our data will not have a computable change from the previous year, it will produce an NA in the diff_log column. This particular row will not contain information relevant to our analysis, so we can simply omit the rows that have NAs. 

head(baseball_variables, 5) # let's take a look to make sure we've got what we need
##                   team year diff_log payroll win_pct diff_win_pct
## 2 Arizona Diamondbacks 1999   0.8019    70.5   0.617       0.2160
## 3 Arizona Diamondbacks 2000   0.1392    81.0   0.525      -0.0926
## 4 Arizona Diamondbacks 2001   0.0022    81.2   0.568       0.0432
## 5 Arizona Diamondbacks 2002   0.2360   102.8   0.605       0.0370
## 6 Arizona Diamondbacks 2003  -0.2430    80.6   0.519      -0.0864

3.2 Exploratory questions

  1. Which five teams had highest increase in their payroll between years 2010 and 2014, inclusive?

  2. Between 2010 and 2014, inclusive, which team(s) “improved” the most? That is, had the biggest percentage gain in wins?

To do this, we will isolate data from the years 2014 and 2010, and create a new variable detailing the difference in pay and win percentage between these years. To see the top 5 teams, we will display the top 5 entries in a new data.frame containing this information.

# to answer these questions, let's create a new data.frame that contains each team, their corresponding log change in payroll between 2010 and 2014, and their corresponding change in win percentage between those same years
baseball_variables_2010_2014 <- data.frame(unique(baseball_variables$team), 
                                           log(baseball_variables$payroll[baseball_variables$year == 2014]) - log(baseball_variables$payroll[baseball_variables$year == 2010]),
                                           baseball_variables$win_pct[baseball_variables$year == 2014] - baseball_variables$win_pct[baseball_variables$year == 2010]) 
colnames(baseball_variables_2010_2014) <- c('team', 'diff_log_2014_2010', 'diff_win_pct_2014_2010') # appropriately name the columns

# now, let's see which 5 teams had the largest increase in payroll between 2010 and 2014
baseball_variables_2010_2014 %>%
  arrange(desc(diff_log_2014_2010)) %>% # order rows by descending log pay change
  head(5) # show the top 5
##                   team diff_log_2014_2010 diff_win_pct_2014_2010
## 1  Los Angeles Dodgers              0.908                 0.0864
## 2        Texas Rangers              0.901                -0.1420
## 3     San Diego Padres              0.869                -0.0802
## 4   Pittsburgh Pirates              0.804                 0.1914
## 5 Washington Nationals              0.785                 0.1667
# which 5 teams had the largest increase in win percentage between 2010 and 2014?
baseball_variables_2010_2014 %>%
  arrange(desc(diff_win_pct_2014_2010)) %>% # order rows by descending win_pct change
  head(5) # show the top 5
##                   team diff_log_2014_2010 diff_win_pct_2014_2010
## 1   Pittsburgh Pirates             0.8044                  0.191
## 2    Baltimore Orioles             0.2746                  0.185
## 3 Washington Nationals             0.7853                  0.167
## 4     Seattle Mariners            -0.0661                  0.160
## 5   Kansas City Royals             0.2418                  0.136

As we can see, the top 5 teams with the highest log payroll increase between 2010 and 2014 are:

  1. Los Angeles Dodgers
  2. Texas Rangers
  3. San Diego Padres
  4. Pittsburgh Pirates
  5. Washington Nationals

And the top 5 teams with the highest win percentage increase between 2010 and 2014 are:

  1. Pittsburgh Pirates
  2. Baltimore Orioles
  3. Washington Nationals
  4. Seattle Mariners
  5. Kansas City Royals

3.3 Do log increases in payroll imply better performance?

Is there evidence to support the hypothesis that higher increases in payroll on the log scale lead to increased performance?

Pick up a few statistics, accompanied with some data visualization, to support your answer.

To test this I will run a linear regression analysis, testing how well change in log payroll predicts change in win percentage. I am using change in win percentage as my dependent variable as opposed to win percentage because this provides a picture of how performance responds to our putative predictor of interest.

# run a linear model to evaluate the regression of diff_log on change in win percentage 
baseball_model <- lm(diff_win_pct ~ diff_log, data = baseball_variables)
summary(baseball_model) # lm output
## 
## Call:
## lm(formula = diff_win_pct ~ diff_log, data = baseball_variables)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.20317 -0.04727 -0.00255  0.04569  0.26860 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)  
## (Intercept)  0.00033    0.00317    0.10    0.917  
## diff_log     0.01606    0.00791    2.03    0.043 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.0715 on 507 degrees of freedom
## Multiple R-squared:  0.00808,    Adjusted R-squared:  0.00612 
## F-statistic: 4.13 on 1 and 507 DF,  p-value: 0.0427
# ok, let's plot the lm 
ggplot(baseball_variables, aes(x = diff_log, y = diff_win_pct)) + geom_point(col = "steelblue4") + stat_smooth(method = "lm", col = "red3") + ggtitle("Linear Model | ∆Win Percentage ~ ∆Log Payroll | p=0.0427, Adj. R^2=0.00612") + xlab("∆Log Payroll") + ylab("∆Win Percentage")
## `geom_smooth()` using formula 'y ~ x'

It appears based on this analysis that while there does seem to be a statistically significant association between log increase in pay and increase in performance (p = 0.043), the effect size is small (Adj. R-Squared = 0.00612). Accordingly, at this stage my interpretation is that while there is in fact evidence that an increase in payroll may lead to an increase in performance, it is likely that change in payroll alone does not adequately explain performance, or that this is only true for some teams, and these specific teams are driving the result.

We have established that aggregating across all teams, there is a small predictive relationship between change in log payroll and change in performance. However, we can dive deeper and see if this is true for all teams, or just a subset of teams. It is possible that for some teams there is a stronger predictive relationship than others. To test this, I will conduct a regression for each team, evaluating the change in log payroll as a predictor of change in performance.

# since we'll be running many linear regressions (one for each team), it's best to initialize empty arrays to contain p-value and effect size (adj. R^2) of each regression
p_vector <- rep(NA, length(unique(baseball_variables$team))) # initialize p-value vector
effect_vector <- rep(NA, length(unique(baseball_variables$team))) # initialize effect size vector 
# this loop runs a regression on each team - every iteration corresponds to a single team
for (i in 1:length(unique(baseball_variables$team))){
  temporary_model <- lm(diff_win_pct ~ diff_log, data = baseball_variables %>% filter(team == unique(baseball_variables$team)[i])) # run the regression, select only the rows that pertain to the team of interest
  p_vector[i] <- summary(temporary_model)$coefficients["diff_log",4] # extract the p-value 
  effect_vector[i] <- summary(temporary_model)$adj.r.squared # extract the adjusted R^2
}

# great! now let's create a data.frame that contains p-value, effect size, and team
teamwise_regression_result <- data.frame(team = unique(baseball_variables$team), p_value = p_vector, effect_size = effect_vector)

# visualize results
# create a bar plot showing each team's p-value in their regression
p_plot <- ggplot(data=teamwise_regression_result, aes(x=team,y=p_value, fill=team)) + geom_bar(stat='identity') +
  theme(axis.text.x = element_text(angle = -90)) +
  geom_hline(yintercept = 0.05, color = 'red') + 
  ggtitle("Teamwise ∆Win Percentage ~ ∆Log Payroll Regressions p-value") + theme(legend.position = "none", axis.title.x=element_blank()) + ylab("p-value") 
# create a bar plot showing each team's adjusted r-squared value in their regression
rsq_plot <- ggplot(data=teamwise_regression_result, aes(x=team,y=effect_size, fill=team)) + geom_bar(stat='identity') +
  theme(axis.text.x = element_text(angle = -90)) +
  ggtitle("Teamwise ∆Win Percentage ~ ∆Log Payroll Regressions Adj. R^2") + theme(legend.position = "none", axis.title.x=element_blank()) + ylab("Effect Size (Adj. R-Squared)")
gridExtra::grid.arrange(p_plot, rsq_plot, nrow=1) # plot these together

From these results, we can thus infer than this predictive relationship between change in log payroll and change in win percentage is only present in a handful of teams. Specifically, the following teams, whose corresponding regressions resulted in a p-value less than 0.05, with substantial Adj. R^2 values:

  1. Atlanta Braves
  2. Miami Marlins
  3. New York Yankees
  4. Tampa Bay Rays
  5. Toronto Blue Jays

The result of the earlier regression (across all teams) is likely driven by these 5 teams.

So, in summary, we can conclude the following: While at the aggregate level, there is evidence that an increase in log payroll may lead to an increase in win percentage, the win percentage of the five teams listed above exhibit a robust response to change in log payroll, while other teams do not.

3.4 Comparison

Which set of factors are better explaining performance? Yearly payroll or yearly increase in payroll? What criterion is being used?

In order to make this comparison, we will compare the regressions we ran to test the change in log pay’s ability to predict change in win percentage among individual teams, with regressions testing the ability of raw payroll to predict change in win percentage, as well as both payroll and change in log payroll’s ability to individually predict overall win percentage. This will allow us to see if one variable better explains performance as such, change in performance, or both. In all, we will compare 4 analytic setups.

The criteria we will use is the effect size (R^2) and statistical significance (p-value) of the regression of each independent variable (payroll vs change in log payroll) on each dependent variable (win percentage vs change in win percentage). We will compare how many teams achieve statistical significance and a robust effect size in each analysis.

# ----------------- #
#    Analysis 2     #
# ----------------- #

# Let's run the analysis using Payroll to predict change in win percentage
p_vector_2 <- rep(NA, length(unique(baseball_variables$team))) # initialize p-value vector
effect_vector_2 <- rep(NA, length(unique(baseball_variables$team))) # initialize effect size vector 
# this loop runs a regression on each team - every iteration corresponds to a single team
for (i in 1:length(unique(baseball_variables$team))){
  temporary_model <- lm(diff_win_pct ~ payroll, data = baseball_variables %>% filter(team == unique(baseball_variables$team)[i])) # run the regression, select only the rows that pertain to the team of interest
  p_vector_2[i] <- summary(temporary_model)$coefficients["payroll",4] # extract the p-value 
  effect_vector_2[i] <- summary(temporary_model)$adj.r.squared # extract the adjusted R^2
}
# great! now let's create a data.frame that contains p-value, effect size, and team
teamwise_regression_result_2 <- data.frame(team = unique(baseball_variables$team), p_value = p_vector_2, effect_size = effect_vector_2)
# Save plots
# create a bar plot showing each team's p-value in their regression
p_plot_2 <- ggplot(data=teamwise_regression_result_2, aes(x=team,y=p_value, fill=team)) + geom_bar(stat='identity') +
  theme(axis.text.x = element_text(angle = -90)) +
  geom_hline(yintercept = 0.05, color = 'red') + 
  ggtitle("Teamwise ∆Win Percentage ~ Payroll Regressions p-value") + theme(legend.position = "none", axis.title.x=element_blank(), axis.text.x=element_blank()) + ylab("p-value") 
# create a bar plot showing each team's adjusted r-squared value in their regression
rsq_plot_2 <- ggplot(data=teamwise_regression_result_2, aes(x=team,y=effect_size, fill=team)) + geom_bar(stat='identity') +
  theme(axis.text.x = element_text(angle = -90)) +
  ggtitle("Teamwise ∆Win Percentage ~ Payroll Regressions Adj. R^2") + theme(legend.position = "none", axis.title.x=element_blank(), axis.text.x=element_blank()) + ylab("Adj. R-Squared")

# ----------------- #
#    Analysis 3     #
# ----------------- #

# now, let's run the analysis using Payroll to predict Win Percentage
p_vector_3 <- rep(NA, length(unique(baseball_variables$team))) # initialize p-value vector
effect_vector_3 <- rep(NA, length(unique(baseball_variables$team))) # initialize effect size vector 
# this loop runs a regression on each team - every iteration corresponds to a single team
for (i in 1:length(unique(baseball_variables$team))){
  temporary_model <- lm(win_pct ~ payroll, data = baseball_variables %>% filter(team == unique(baseball_variables$team)[i])) # run the regression, select only the rows that pertain to the team of interest
  p_vector_3[i] <- summary(temporary_model)$coefficients["payroll",4] # extract the p-value 
  effect_vector_3[i] <- summary(temporary_model)$adj.r.squared # extract the adjusted R^2
}
# great! now let's create a data.frame that contains p-value, effect size, and team
teamwise_regression_result_3 <- data.frame(team = unique(baseball_variables$team), p_value = p_vector_3, effect_size = effect_vector_3)
# visualize results
# create a bar plot showing each team's p-value in their regression
p_plot_3 <- ggplot(data=teamwise_regression_result_3, aes(x=team,y=p_value, fill=team)) + geom_bar(stat='identity') +
  theme(axis.text.x = element_text(angle = -90)) +
  geom_hline(yintercept = 0.05, color = 'red') + 
  ggtitle("Teamwise Win Percentage ~ Payroll Regressions p-value") + theme(legend.position = "none", axis.title.x=element_blank(), axis.text.x=element_blank()) + ylab("p-value") 
# create a bar plot showing each team's adjusted r-squared value in their regression
rsq_plot_3 <- ggplot(data=teamwise_regression_result_3, aes(x=team,y=effect_size, fill=team)) + geom_bar(stat='identity') +
  theme(axis.text.x = element_text(angle = -90)) +
  ggtitle("Teamwise Win Percentage ~ Payroll Regressions Adj. R^2") + theme(legend.position = "none", axis.title.x=element_blank(), axis.text.x=element_blank()) + ylab("Adj. R-Squared")

# ----------------- #
#    Analysis 4     #
# ----------------- #

# finally, let's run the analysis to test how well change in payroll predicts win_percentage
p_vector_4 <- rep(NA, length(unique(baseball_variables$team))) # initialize p-value vector
effect_vector_4 <- rep(NA, length(unique(baseball_variables$team))) # initialize effect size vector 
# this loop runs a regression on each team - every iteration corresponds to a single team
for (i in 1:length(unique(baseball_variables$team))){
  temporary_model <- lm(win_pct ~ diff_log, data = baseball_variables %>% filter(team == unique(baseball_variables$team)[i])) # run the regression, select only the rows that pertain to the team of interest
  p_vector_4[i] <- summary(temporary_model)$coefficients["diff_log",4] # extract the p-value 
  effect_vector_4[i] <- summary(temporary_model)$adj.r.squared # extract the adjusted R^2
}
# great! now let's create a data.frame that contains p-value, effect size, and team
teamwise_regression_result_4 <- data.frame(team = unique(baseball_variables$team), p_value = p_vector_4, effect_size = effect_vector_4)
# visualize results
# create a bar plot showing each team's p-value in their regression
p_plot_4 <- ggplot(data=teamwise_regression_result_4, aes(x=team,y=p_value, fill=team)) + geom_bar(stat='identity') +
  theme(axis.text.x = element_text(angle = -90)) +
  geom_hline(yintercept = 0.05, color = 'red') + 
  ggtitle("Teamwise Win Percentage ~ ∆Log Payroll Regressions p-value") + theme(legend.position = "none", axis.title.x=element_blank(), axis.text.x=element_blank()) + ylab("p-value") 
# create a bar plot showing each team's adjusted r-squared value in their regression
rsq_plot_4 <- ggplot(data=teamwise_regression_result_4, aes(x=team,y=effect_size, fill=team)) + geom_bar(stat='identity') +
  theme(axis.text.x = element_text(angle = -90)) +
  ggtitle("Teamwise Win Percentage ~ ∆Log Payroll Regressions Adj. R^2") + theme(legend.position = "none", axis.title.x=element_blank(), axis.text.x=element_blank()) + ylab("Adj. R-Squared")

# ----------------- #
#    PLOTTING       #
# ----------------- #

# just for ease let's remake these plots without team names
p_plot <- ggplot(data=teamwise_regression_result, aes(x=team,y=p_value, fill=team)) + geom_bar(stat='identity') +
  theme(axis.text.x = element_text(angle = -90)) +
  geom_hline(yintercept = 0.05, color = 'red') + 
  ggtitle("Teamwise ∆Win Percentage ~ ∆Log Payroll Regressions p-value") + theme(legend.position = "none", axis.title.x=element_blank(), axis.text.x=element_blank()) + ylab("p-value") 
# create a bar plot showing each team's adjusted r-squared value in their regression
rsq_plot <- ggplot(data=teamwise_regression_result, aes(x=team,y=effect_size, fill=team)) + geom_bar(stat='identity') +
  theme(axis.text.x = element_text(angle = -90)) +
  ggtitle("Teamwise ∆Win Percentage ~ ∆Log Payroll Regressions Adj. R^2") + theme(legend.position = "none", axis.title.x=element_blank(), axis.text.x=element_blank()) + ylab("Adj. R-Squared")

# now, we can visualize all our results and make comparisons
gridExtra::grid.arrange(p_plot, rsq_plot, p_plot_2, rsq_plot_2, p_plot_3, rsq_plot_3, p_plot_4, rsq_plot_4, nrow=4) # plot these together

Based on the p-value and Adjusted R-Squared of our analyses:

  1. Change in log payroll robustly predicts change in performance for 5 teams.
  2. Payroll robustly predicts change in performance for 1 team.
  3. Payroll robustly predicts performance for 8 teams.
  4. Change in log payroll robustly predicts performance for 2 teams.

As we can see, it seems that change in log payroll better predicts change in win percentage, while raw payroll better predicts win percentage. So, the predictor variable we should use will depend on how we define “explaining performance”.